08. Calculating Correlation
Calculating Correlation
Question:
Start Quiz:
import pandas as pd
filename = '/datasets/ud170/subway/nyc_subway_weather.csv'
subway_df = pd.read_csv(filename)
def correlation(x, y):
'''
Fill in this function to compute the correlation between the two
input variables. Each input is either a NumPy array or a Pandas
Series.
correlation = average of (x in standard units) times (y in standard units)
Remember to pass the argument "ddof=0" to the Pandas std() function!
'''
return None
entries = subway_df['ENTRIESn_hourly']
cum_entries = subway_df['ENTRIESn']
rain = subway_df['meanprecipi']
temp = subway_df['meantempi']
print correlation(entries, rain)
print correlation(entries, temp)
print correlation(rain, temp)
print correlation(entries, cum_entries)
Solution:
INSTRUCTOR NOTE:
Understand and Interpreting Correlations
- This page contains some scatterplots of variables with different values of correlation.
- This page lets you use a slider to change the correlation and see how the data might look.
- Pearson's r only measures linear correlation! This image shows some different linear and non-linear relationships and what Pearson's r will be for those relationships.
Corrected vs. Uncorrected Standard Deviation
By default, Pandas' std() function computes the standard deviation using Bessel's correction. Calling std(ddof=0) ensures that Bessel's correction will not be used.
Previous Exercise
The exercise where you used a simple heuristic to estimate correlation was the "Pandas Series" exercise in the previous lesson, "NumPy and Pandas for 1D Data".
Pearson's r in NumPy
NumPy's corrcoef() function can be used to calculate Pearson's r, also known as the correlation coefficient.